Search CORE

20 research outputs found

A new hybrid metric for verifying parallel corpora of Arabic-English

Author: Alkahtani S.
Alkahtani S.M.
Liu W.
Teahan W.J.
Publication venue
Publication date: 24/01/2015
Field of study

An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes

Author: C.G. Nevill-Manning
H. Mannila
M.R. Brent
W.J. Teahan
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2002
Field of study

This paper describes an unsupervised algorithm for segmenting categorical time series into episodes. The Voting-Experts algorithm first collects statistics about the frequency and boundary entropy of ngrams, then passes a window over the series and has two “expert methods” decide where in the window boundaries should be drawn. The algorithm successfully segments text into words in four languages. The algorithm also segments time series of robot sensor data into subsequences that represent episodes in the life of the robot. We claim that Voting-Experts finds meaningful episodes in categorical time series because it exploits two statistical characteristics of meaningful episodes

CiteSeerX

Crossref

ScholarWorks@UMass Amherst

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Author: C. Ringlstetter
C.J. Rijsbergen van
D. Lopresti
F.J. Damerau
G. Navarro
G.K. Zipf
K. Kukich
Martin W. C. Reynaert
U. Frauenfelder
W.J. Teahan
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Tag based models of English text

Author: Cleary John G.
Teahan W.J.
Publication venue
Publication date: 01/11/1997
Field of study

The problem of compressing English text is important both because of the ubiquity of English as a target for compression and because of the light that compression can shed on the structure of English. English text is examined in conjunction with additional information about the parts of speech of each word in the text (these are referred to as “tags”). It is shown that the tags plus the text can be compressed more than the text alone. Essentially the tags can be compressed for nothing or even a small net saving in size. A comparison is made of a number of different ways of integrating compression of tags and text using an escape mechanism similar to PPM. These are also compared with standard word based and character based compression programs. The result is that the tag character and word based schemes always outperform the character based schemes. Overall, the tag based schemes outperform the word based schemes. We conclude by conjecturing that tags chosen for compression rather than linguistic purposes would perform even better

Research Commons@Waikato

Grammatical Herding

Author: Headleand C.J.
Teahan W.J.
Publication venue: Omics International
Publication date
Field of study

Correcting English text using PPM models

Author: Cleary John G.
Holmes Geoffrey
Inglis Stuart J.
Teahan W.J.
Publication venue: Computer Science, University of Waikato
Publication date: 01/11/1997
Field of study

An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized; while for spelling correction, two characters may be transposed, or a character may be inadvertently inserted or missed out. This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-the-art commercial OCR system. We show that the accuracy of the OCR system can be increased from 95.9% to 96.6%, a decrease of about 10 errors per page

Research Commons@Waikato

Adaptive models of English text

Author: Cleary John G.
Teahan W.J.
Publication venue: Department of Computer Science, The University of Waikato
Publication date: 01/11/1997
Field of study

High quality models of English text with performance approaching that of humans is important for many applications including spelling correction, speech recognition, OCR, and encryption. A number of different statistical models of English are compared with each other and with previous estimates from human subjects. It is concluded that the best current models are word based with part of speech tags. Given sufficient training text, they are able to attain performance comparable to humans

Research Commons@Waikato

Constituent Grammatical Evolution

Author: Georgiou L.
Teahan W.J.
Publication venue: Unknown
Publication date
Field of study

Universal text preprocessing and postprocessing for PPM using Alphabet Adjustment

Author: Alhawiti K.
Teahan W.J.
Publication venue
Publication date
Field of study

In this paper, we introduce several new universal pre-processing techniques to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially 'adjust' the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text

Crossref

A 300 MB Turkish Corpus and Word Analysis

Author: C.E. Shannon
T. Güngör
W.J. Teahan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2002
Field of study

© Springer-Verlag Berlin Heidelberg 2002.In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process

Crossref

Dokuz Eylul University Research Information System